Credit Card Users Churn Prediction

Exploratory Data Analysis and Insights

Problem definition, questions to be answered

Introduction

Background and Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards You need to identify the best possible model that will give the required performance

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary

Overview and Descriptive Statistics

Importing of Libraries

Loading of the Dataset

View the first 5 rows of the dataset.

Check data types and number of non-null values for each column.

Check duplicates Count

Fixing the data types

Summary of the dataset

Number of unique values in each column

Number of observations in each category

Univariate Analysis (Numerical)

Customer_Age

Months_on_book

Credit_Limit

Total_Revolving_Bal

Avg_Open_To_Buy

Total_Amt_Chng_Q4_Q1

Total_Trans_Amt

Total_Trans_Ct

Total_Ct_Chng_Q4_Q1

Avg_Utilization_Ratio

Univariate Analysis (Categorical)

Dependent_count

Total_Relationship_Count

Months_Inactive_12_mon

Contacts_Count_12_mon

Gender

Education_Level

Marital_Status

Income_Category

Card_Category

Bivariate Analysis (Numerical)

Attrition_Flag vs Numerical variables

It is difficult to make an interpretation from the graphs above let's visualize them by removing these outliers (for visualization not from orignal data) to get a better understanding

Bivariate Analysis (Categorical)

Encoding of the Target Variable

Attrition_Flag vs Dependent_count

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Contacts_Count_12_mon

Attrition_Flag vs Gender

Attrition_Flag vs Education_Level

Attrition_Flag vs Marital_Status

Attrition_Flag vs Income_Category

Attrition_Flag vs Card_Category

Multivariate Analysis

Attrition_Flag vs Total_Relationship_Count vs Total_Trans_Ct

Attrition_Flag vs Months_Inactive_12_mon vs Total_Trans_Ct

Attrition_Flag vs Months_Inactive_12_mon vs Total_Revolving_Bal

Attrition_Flag vs Card_Category vs Customer_Age

Income _Category vs Card_Category vs Customer_Age

Summary of EDA

Bivariate

Multivariate

General

Data pre-processing

Preparing the Data for Analysis

Checking Duplicates and Dropping the ID

There are no duplicates in the data

Dropping the ID

Log Scaling Skewed variables

Feature Engineering

Avg_Trans_Amt

Avg_Open_Ratio

Rev_Credit_Ratio

Log Scaling Skewed variables

Updated Correlation Plot

Dropping the Redundant Columns

Correlation Plot after deleting redundant variables

Fixing Observed data Bias in Months_Inactive_12_mon and Contacts_Count_12_mon

Months_Inactive_12_mon

Contacts_Count_12_mon

Income_Category

Encoding of Variables

Income_Category

Education_Level

Marital_Status

Summary of Feature Engineering

Train - Test Split

Missing Value Treatment

KNN inputation

Rectification of the Label Encoding

The numerical labels will be inversed into Categorical

Income_Category

Education_Level

Marital_Status

Outlier Treatment

Display percentage outliers

Preparing the Data for Modelling

Dummy Variables Creation

Model building

Introduction

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will attrit and does not - Loss of resources
  2. Predicting a customer will not attrit but does - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Stratified K-Folds cross-validation

Training Performance

Summary (Base Models)

Model building - Oversampled data

Oversampling train data using SMOTE

Stratified K-Folds cross-validation

Training Performance

Summary (Oversampled Models)

Model building - Undersampled data

Undersampling train data using Random Under Sampler

Stratified K-Folds cross-validation

Training Performance

Summary (Undersampled Models)

Hyperparameter tuning using random search

Comparing Model Performance

So far, 18 models have been built. This section will compare the performance of all the models to choose the best three models that will undergo hyperparameter tuning

Summary (All Models)

Hyperparameter Tuning

We will tune Xgboost, Random Forest and Gradient Boost oversampled models using RandomizedSearchCV. We will also compare the performance and time taken by these three methods

First, let's create two functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Gradient Boost

Random Forest Classifier

XGBoost

Model Performances

Comparing best three tuned Models

Best Model Performance on the test set

Productionize the model

Pipelines for productionizing the model

Column Transformer

Actionable Insights & Recommendations